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Abstract — In this paper we consider the use of variable 
length non prefix-free codes for coding constrained sequences of 
symbols. We suppose to have a Markov source where some state 
transitions are impossible, i.e. the stochastic matrix associated 
with the Markov chain has some null entries. We show that classic 
Kraft inequality is not a necessary condition, in general, for 
unique decodability under the above hypothesis and we propose 
a relaxed necessary inequality condition. This allows, in some 
cases, the use of non prefix-free codes that can give very good 
performance, both in terms of compression and computational 
efficiency. Some considerations are made on the relation between 
the proposed approach and other existing coding paradigms. 

I. Introduction 

Variable length codes are usually considered to have to 
satisfy the Kraft inequality ([1]), as it was proved that this is 
a necessary condition for unique decodability in [2]. Unique 
decodability is always defined as a characteristic of the set 
of code words only, and no reference to the type of source 
is considered. Thus, imposing unique decodability we are 
requiring any sequence of symbols being distinguishable. This 
fact is clearly perfectly acceptable if the source can produce 
any sequence of symbol with non-null probability as, for 
example, in the memoryless source case. But what happens 
if the source is not memoryless and, more precisely, not all 
sequences are allowed? Consider for example a source with 
three symbols A, B and C, and suppose that symbol A can 
never be followed by B. This fact is a characteristic of the 
source that we can consider known for the encoding problem, 
as we actually do in the classic Shannon paradigm where no 
bits are assigned to null-probability events. Thus, the decoder 
is usually supposed to know that an event is not possible, 
at least from the fact that that no code word is assigned 
to that event. Thus, suppose we use the codewords '0', '1' 
and '01' for our source symbols A, B and C respectively. 
It is easy to see that any sequence of symbols that can be 
generated by the source is uniquely specified by concatenating 
the codewords of every single symbol. This is because with 
this code one may only confuse sequences of symbols with 
sequences that our source cannot generate. The objective of 
this paper is to propose the notion of unique decodability of 
a set of codewords relatively to a specific information source. 
We show simple examples of Markov sources that can be 
efficiently encoded by non-prefix free codes. Our examples 
also show that there are sources for which a coding technique 
exists such that the expected number of bits of the code for 
the first n symbols of the source is strictly smaller than the 
entropy of those first n symbols (for every n). This fact is 



often erroneously considered impossible, and this error arises 
from considering the Kraft inequality a necessary condition 
for any type of sources. 

II. A curious Markov chain example 

Suppose, so as to analyze more throughly the example 
of the introduction, we have a source generating symbols 
Xi,X 2l X 3 , . . . extracted from the set X = {A, B, C} fol- 
lowing the Markov chain graphically shown in figure ^ I n 
formulae, if we call Sj the probability distribution row vector 
on X at step n, we have S^ + i = S^P, where P is the transition 
probability matrix 
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If we suppose the initial state has uniform probability Si = 
[1/3,1 /3, 1/3], it is easy to verify that the process is stationary, 
i.e Si = Si for every i. Thus, the entropy H(Xi,X2, ■ ■ ■ , Xk) 
of the first k symbols can be easily computed. We have 



H(Xi,X 2 , ■ ■ ■ , Xk) — 

H{X 1 ) + H{X 2 \X 1 ) + - 

H{X l )+H{X 2 \X 1 ){k 



+ H(X k \X k _ 1 ) = 

l) = log(3) + ~(fc-l) 



Let us consider now the codeword assignment A — > 0, B — > 1, 
C — > 01. This code is clearly not prefix-free but, as explained 
in the introduction, when used for this source no ambiguity 
can arise. If l(X) is the length of the code associated to X, 
we can compute the expected number of bits in coding the 




Fig. 1. Graph, with transition probabilities, for a Markov Chain. 



first k symbols as 



as B cannot follow A in a source sequence. Let us consider 
the matrix 



E[l(X 1 X 2 X 3 ---X k )]=E 



Y,E[l(X i )]=k(- + - + -)=-k 



It is easy to verify that this value is strictly smaller than the 
entropy H(Xi, X 2 , ■ ■ ■ , X/,) above computed. What happens 
with our coding procedure? And what happens when the 
number of symbols goes to infinity? Looking carefully at our 
example, we note that our coding strategy uses an expected 
number of 4/3 bits for coding the first symbol, while its 
entropy is log 3. For the following symbols, in turn, the 
entropies H(X 2 \Xi), H(X 3 \X 2 ) ■ ■ ■ equal 4/3 bits, and thus 
they have exactly the same value as the number of bits used 
by our code. So, we can say that our code only gains in the 
first symbol and not substantially (as it is obvious from AEP 
for ergodic sources). But this fact is somehow interesting; our 
code assigns to the first symbol a number of bits smaller than 
its entropy, using the memory properties of the source, without 
affecting unique decodability. Thus, if we consider the case 
when only a finite number of symbols are given the problem 
of finding the optimal coding strategy arises. Furthermore, we 
should consider that in the more general case, when higher 
order constraint are eventually present, the problem becomes 
much more intriguing. 

III. Kraft inequality for constrained sequences 

McMillan showed in [2] that a necessary condition for the 
unique decodability of a set of n codewords is that their 
lengths li, l 2 , . . . , l n satisfy the Kraft inequality 
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Karush ([3]) gave a simple proof of this fact by considering 
that for every k > the following inequality must be satisfied 
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The term on the left hand side of can be written as the 
sum of weights of codes of possible sequences of k symbols. 
For example, a sequence starting with x±, X3, X2, ■ ■ ■ gives a 
term 2~ Zl 2~' 3 2 _/2 • ■ ■ in the expantion of the left hand side 
of 0. In order to have only one sequence assigned to every 
code the above inequality is necessary. But if we only want 
to distinguish between the possible sequences generated by 
a constrained source, we may rewrite the condition in a less 
stringent form. Let us consider once more as an example the 
source of fig. ^ with l\, l 2 and l 3 lengths assigned respectively 
to A, B and C. Thus, terms on the expansion of left hand side 
of that contains ■ • • 2~ Zl 2~' 2 ■ ■ • should not be considered 
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it is possible to verify that the really necessary correspondent 
of eq. Q for our source should be written, for k > 0, as 



[ 1 1 1 ]Q 
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It is possible to show that a necessary condition for this 
inequality to be satisfied for every k is that the matrix Q 
has spectral radius 1 at most equal to 1. We state and prove 
this fact in the general case. 

Theorem 1: Let P be an irreducible n x n stochastic matrix 
and 1 = [li,l 2 , . . . ,l n ] a vector of n integers. Let Q be the 
n x n matrix such that 



Qi 



if Pi, 







if Pij > 



(6) 



Then, a necessary condition for the codeword lengths lx, l 2 , 
. . . , l n to be lengths of a uniquely decodable code for a Markov 
source with transition probability matrix P is that p(Q) < 1, 
where p(Q) is the spectral radius of Q. 

Proof: We follow Karush's proof of McMillan theorem. 
Suppose without loss of generality that the set of our source 
symbols is X = {1,2, ...,n}, and call X^ k ' the set of all 
sequences of k symbols that can be produced by the source. 
Let us set L = 2 _1 = [2 _Jl ,2 _ ' a ,. . . ,2~ 1 "] and define, for 
k > 0, 

V T k = Q fe_1 L T . (7) 

Then it is easy to see by induction that the i-th component of 
Vfc is written as 

yi _ 2~' x i~' a: 2----^fc (8) 

where the sum runs over all elements (xi, x 2 , . . . , Xk ) of 
with xi — i. So, if we call 1„ the row vector composed of n 
l's, we have 



l n Q fel L T = £ 2-*» 
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Xl,X 2 ,...,Xk 



where the sum now runs over all elements of X^ k \ Thus, 
reindexing the sum with respect to the total length I — l Xl + 
l x , 2 + • ■ • + l Xk and calling ci the number of sequences of XW 
to which correspond a code of length I, we have 

fcimax 

l n Q fc - 1 L T - J2 Cl 2~ l (10) 

where / m i n and l mdx are respectively the maximum and the 
minimum of the values — 1,2, . . . ,n. Since the code 

'The spectral radius of a matrix is defined as the greatest modulus of its 
eigenvalues. 



is uniquely decodable, all q sequences of length I must be 
different and so they are at most 2 l . This implies that, for 
every k > 0, we must have 

hi 

^ 1 max 

l B Q fc - x L T < £ 2'2- i = fc(Ux-U„ + l) (ID 
/=fei min 

Now, note that the irreducible matrix Q is also nonnegative. 
Thus, for the Perron-Frobenius theorem (see [4] for details), 
its spectral radius p(Q) is also an eigenvalue 2 , with algebraic 
multiplicity 1 and with positive associated eigenvector. As the 
vectors l n and L are both positive, this implies that the term 
on the left hand side of eq. (II Q asymptotically grows like 
p(Q) . On the contrary, the right hand side term only grows 
linearly with k; so, taking the limit as k — > oo in eq. we 
conclude that p(Q) < 1. ■ 

We note that if the P matrix has all strictly positive entries, 
the matrix Q reduces to have all equal columns, and its 
spectral radius is exactly ^2~ (i . Thus, for non-constrained 
sequences, we obtain the classic Kraft inequality. Furthermore, 
as the spectral radius of a nonnegative positive matrix increases 
if any of the elements increases, we note that the situation 
p(Q) = 1 is a strong condition on P and 1. In the sense 
that if for a given matrix P there is a decodable code with 
codeword lengths U,i = l,...,n such that p(Q) = 1, then 
there is no decodable code with lengths l\ if l[ < k for some i. 
Also, it is not possible to remove constraints from the Markov 
chain while keeping unique decodability property. 

The most important remark, however, concerns the non 
sufficiency of the stated condition. In fact, while the classic 
Kraft inequality is a necessary and sufficient condition for the 
existence of a uniquely decodable code for an unconstrained 
sequence, the found inequality p(Q) < 1 is unfortunately 
only necessary, and not sufficient. We discuss this point in the 
next section, where we propose an extension of the Sardinas 
Patterson test for testing the unique decodability of a code for 
a constrained sequence. 

IV. NON SUFFICIENCY AND SARDINAS PATTERSON TEST 

In the preceding sections we have shown that the classic 
Kraft inequality is not, in general, a necessary condition 
for the unique decodability of a constrained sequence, and 
we have found a necessary condition under this hypothesis. 
Unfortunately, the found condition is not sufficient and trivial 
examples show this fact. We note that the only parameter 
determining the matrix Q are the length vector 1 and the 
graph associated to the Markov chain, i.e. the state pairs 
with positive transition probability. Thus, we only consider the 
transition graphs of the sources without taking into account the 
value of the transition probabilities. Consider for example a 
source with three symbols A, B and C with transition graph 
as shown in fig. |2(a)| It is easy to see that if 1 = [1, 1, 1] 
then p(Q) = 1; anyway, it is clearly impossible to decode the 
sequences of the source if we assign only one bit to every 

2 Note that in general the spectral radius is not an eigenvalue as it is defined 
as the maximum of |A over all eigenvalues A. 
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Fig. 2. Two examples of transition graphs for which p(Q) < 1 is not a 
sufficient condition. 



symbol. In general, we may consider that it is not possible 
to have a decodable code with more than 2' codewords of 
length i, because otherwise even the initial state cannot be 
recovered. Anyway, still imposing this additional condition 
does not suffice. Take for example a code with 1 = [1,1,2] for 
a source with transition graph as shown in fig. |2(b)| we have 
p(Q) < 1, only two codewords of 1 bit and one codeword of 
2 bits, but still a decodable code with those lenghts does not 
exist (^4 — > imposes B — ► 1, and consequently C — > 11, but 
so BCB and CC have the same code). 

The above examples show that the question of finding a 
sufficient condition for the unique decodability of codes for 
constrained sequences appears to be more complicated than 
with unconstrained sequences. A positive fact is that it is 
possible to extend the Sardinas Patterson (SP) test ([5]) to 
the case of our interest. Given a set of codewords, the SP 
test allows to establish in a finite number of steps if the code 
is uniquely decodable for unconstrained sequences. Here we 
modify the classic algorithm for the case of constrained ones. 
The generalization is straightforward so that we do not give 
here a formal proof, as it would merely be a rewriting of that 
for the classic SP test, for which we refer the reader to [6, th. 
2.2.1]. 

Suppose our source symbol set is X = {1,2, ... ,n} and 
let us call W = {Wi}i = i^.^ n the set of associated codewords. 
For i = 1, 2, . . . , n we call Fj = {Wj \Pij > 0} the subset of 
W containing all codewords that can follow Wi in a source 
sequence. We construct a sequence of sets S\,S2,... in the 
following way. To form Si we consider all pairs of codewords 
of W; if a codeword Wi is a prefix of another codeword Wj, 
i.e. Wj = WiA we put the suffix A into Si. In order to 
consider only the possible sequences, we have to remember 
the codewords that have generated every suffix; thus, let us 
say that we mark the obtained suffix A with two labels, thus 
we indicate it with iAj. We do this for every i and j. Then, 
for n > 1, S n is constructed by comparing elements of S n ~i 
and elements of W; for a generic element iB m of S n -i we 
consider the subset Fi of W: 

a) If a codeword Wk € Fi is equal to iB m the algorithm 
stops and the code is not decodable, 

b) if iB m is prefix of a codeword W r = iB m C we put the 
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Fig. 3. Two examples of transition graphs for which p(Q) < 1 is a sufficient 
condition. 

labelled m C r suffix into S n , 
c) if if instead a codeword W s is prefix of iB m = W S D, 
we place the labelled suffix s D m into S n . 

The code is uniquely decodable if and only if item a) is never 
reached. Note that the algorithm can be stopped after a finite 
number of steps; there are in fact only a finite number of 
possible different sets Sj and so the sequence Sj,,i = 1,2, . . . 
is either finite or periodic. We note that the code is finite delay 
uniquely decodable if the sequence Sj is finite and infinite 
delay uniquely decodable if the sequence is periodic. 

As an example of SP test for constrained sequences we 
consider the transition graphs shown in fig. [3] For both cases 
we use codewords 0, 1, 01 and 10 for A, B, C and D 
respectively. For the graph of fig. |3(a)| we obtain S% = 
{a^c, b®d}, S2 = 0- Thus the code is finite delay uniquely 
decodable and we can indeed verify that we need to wait at 
most two bits for decoding a symbol. For the graph of fig. 
|3(b)| instead, we have Si = {a^c, bOd}, S 2 = {c'Od, d^c}, 
S3 = S*2 and then 5, ; = S 2 f° r every other i > 3. So, the code 
is still uniquely decodable but with infinite delay; in fact it 
is not possible to distinguish the sequences ADDD ■ ■ ■ and 
CCC ■ ■ ■ until they are finished, so that the delay may be as 
long as we want. 

V. Entropy rate and average lengths 

In the second section we have seen an example of Markov 
chain that is efficiently encoded with a non prefix-free code. 
Here we note that the same thing happens with the sources of 
fig E with the indicated codewords if the transition probability 
matrix associated are 
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respectively, and the initial state is uniformly distributed. For 
this sources the entropy of the first k symbols is (3fc + l)/2 
while our code uses on average 3fc/2 bits. Thus we cannot 
say that, even for stationary ergodic processes 3 , the minimum 
average length of the code for the first k symbols is greater 
than or equal to their entropy. The only thing we can say is 
about the entropy rate. 

Given a Markov source X 1X2X3 ■ ■ ■ with transition matrix 
P, let /i = [/Xi,/i2, • • • ,/in] be the stationary distribution, 
and TL the entropy rate. Using the asymptotic equipartition 
property for ergodic sources (Shannon-McMillan theorem, 
[7]), we deduce that for every uniquely decodable code we 
must use at lest TL bits per symbol on average and thus 
1 • fi T > TL. Note that this does not imply that E[l(Xi)] > 
H(Xi), nor, in general, that for some fixed n we have 
E[l(X 1 ,X 2 ,...,X k )] > H(Xi,X 2 ,...,X k ) (see [8, Th. 
5.4.2]), and our examples precisely show that in fact this is 
not true. The point is that for sources with memory we can 
use codes that do not respect the classic Kraft inequality and 
thus the inequality E[l(X)] > H{X) cannot be proved. Thus, 
we should be careful in formulating the converse theorem for 
variable length codes. 

Anyway, the examples shown in this and in preceding 
sections leave many open questions. We have shown examples 
of transition graphs with associated codeword lengths such 
that the obtained code is uniquely decodable. For every one 
of these graphs we have shown that there is a transition 
probability matrix P such that the entropy rate of the source 
exactly equals the average length per symbol of our code. We 
should ask if this is only a coincidence or if it is the rule. 
Furthermore, we may ask what happens in terms of efficiency 
with our code for a generic matrix P on a given graph; is 
it possible to find an optimal codeword assignment as we do 
with Huffman codes in the classic framework? Again, what 
happens if we consider blocks of more than one symbols? We 
clearly obtain a different transition graph and the efficiency 
cannot be lower; but is there a gain in some cases or not? 

VI. Some considerations 

It is interesting to consider the proposed coding approach 
from the point of view of the encoding complexity. Consider 
for example the stationary Markov chain with transition matrix 
of eq. d!2t and uniform initial distribution. Note that it is 
possible to encode the first k symbols of the source using 
exactly as many bits as the entropy of those symbols using 
Huffman codes. Two bits are used for the first symbol, which 
correspond to its entropy. Then, at every step we use a different 
Huffman code, depending on the preceding step, and use 
on average H(X k+ i \ X k ) = TL — 3/2 bits. Note that this 
technique actually fits with the conditional entropy idea in the 
sense that it really encodes each symbol given the preceding 
one. This implies that the encoder must trace the state of 
the source and choose the code for the new symbol. On the 

3 Note that in this paper we have considered only stationary processes; for 
nonstationary processes there may be still more surprising results. 



contrary, the non prefix-free codeword assignment indicated 
in fig. [3] allows a very simple encoding phase, as there is 
a fixed mapping from symbols to code bits, with the same 
(slightly better) compression performance. The point is that 
we are making a different use of the decoder knowledge 
about possible transitions. Note that, even for the Huffman 
code, we are supposing that the decoder exactly knows what 
transitions are possible and what are not. The difference is 
that with the non prefix-free code we are making the decoder 
more active. This relates the presented idea to other developed 
coding paradigms. We should note in fact, that in practice the 
proposed approach was already used in other contexts. One 
of the oldest examples may be that of modulo-PCM codes 
([9]) for numerical sequences; here only the modulo-4 value 
of every sample is encoded, leaving to the decoder the task 
of understanding the original value using its knowledge on 
the memory of the source. In that case the used code is even 
a singular code 4 but under certain hypothesis this does not 
affect the decodability. Similar ideas are then used in the 
recently reemerged theme of distributed source coding (see 
for example [10] and references therein). Let us consider 
for a moment the problem of noiseless separated coding of 
dependent sequences. The well known Slepian-Wolf theorem 
([11]) says that two correlated memoryless sources, X and 
Y, can be separately lossless coded at rates R(X) and R(Y) 
respectively when jointly decoded, if R(X) > H(X\Y), 
R{Y) > H(Y\X) and R(X) + R{Y) > H(H,Y). Cover 
extended the result to the case of general ergodic sources in 
[12]. Roughly speaking, the used encoding process at every 
encoder consists on considering large blocks of symbols; the 
set of all such blocks is split into disjoint bins and only the 
index of the bin that contains the extracted block is encoded. 
At the decoder the original block for both sequences X and Y 
is recovered by extracting from the pair of specified bins the 
only pair of jointly typical blocks. It is interesting to note that 
this encoding technique actually uses singular codes in order 
to achieve compression, leaving to the decoder the task of 
disambiguating, based on the joint statistic of the two sources. 

The same idea is now being used, in order to shift the 
complexity from the encoder to the decoder, in what may 
be called "source coding based on distributed source coding 
principles". Special attention in this field is being payed to 
the case of video coding (see for example [13], [14]). In this 
contest the memory of the video source is not exploited in 
the encoding phase but in the decoding one. Again, roughly 
speaking, in the encoding phase no motion compensation and 
prediction is applied and singular codes are used for the data 
compression task. In the decoding phase, on the contrary, 
the memory of the source is exploited in order to remove 
ambiguity by using motion compensation. It is interesting 
to see that practical architectural specification for this video 
coding techniques have been presented only in recent years, 

4 In our examples here we have considered only non singular codes with 
singular extension. The non singularity of the code is required in our setting 
where we want to be able to decode a sequence composed of one single 
symbol. 



even if the idea of such an approach to video coding was 
already patented by Witsenhausen and Wyner in late '70s 
([15]). Moreover, it is also interesting to note that there is 
not much difference in this approach with respect to the idea 
behind the modulo-PCM coding above mentioned. 

As a final comment on the general problem of finding 
variable length non prefix-free codes for a given constrained 
sequence, we note that there are many connections with the 
area of coding for constrained channels (see for example [16]). 
The relation between the transition graphs of our sources 
and the possible codeword assignments may find interesting 
counterparts (and maybe answers) in that field, where graph 
theory has already been successfully applied. 
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